{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "### 一.偏差方差是什么   \n",
    "\n",
    "对于模型表现，我们通常首先会从偏差-方差的角度来进行分析，可以简单定义如下：    \n",
    "\n",
    "1）偏差=训练集上的评估指标与最优指标之间的差距；  \n",
    "2）方差=验证集与训练集在评估指标上的差距；   \n",
    "\n",
    "所以我们上一页notebook中的：   \n",
    "偏差=1.0-0.51=0.49   \n",
    "方差=0.51-0.3=0.21   \n",
    "\n",
    "可以发现偏差和方差都不低，且偏差>方差，对于偏差方差的高低不同还有另外两种常见的说法：    \n",
    "欠拟合：高偏差低方差；   \n",
    "过拟合：高方差低偏差"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 二.高偏差低方差（欠拟合）的处理方法   \n",
    "\n",
    "往往是由于模型简单导致的，往增加模型复杂度的方向调整就可以了，通常有些这些方法：    \n",
    "1）加深树深度；    \n",
    "2）增加叶子节点；    \n",
    "3）增加分箱数；    \n",
    "4）添加新特征；   \n",
    "5）添加树数量；   \n",
    "6）减小正则化权重；  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 三.高方差低偏差（过拟合）的处理方法   \n",
    "\n",
    "对于上面6条进行反向操作即可，另外还可以考虑:    \n",
    "1）添加训练数据；   \n",
    "2）添加噪声；  \n",
    "3）baggging集成；   "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 四.优先级问题\n",
    "\n",
    "优先降低偏差，增加模型复杂度，原因很简单，如果模型在训练集上表现都很差时，你还能指望它在验证集上的效果会更好嘛？   "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 五.调参   \n",
    "在找到我们的问题之后，就为调参指明了方向，对于我们的模型，偏差=0.39，方差=0.21，当然首先考虑降低偏差，根据上面的处理方法，我们至少有6种参数可以尝试，另外有点建议：    \n",
    "1）每次固定其它参数，只调整其中一个，这样方便对照，如果你一次调整2个参数，就不好判断究竟是哪个参数起到了主导作用；   \n",
    "2）如果评估指标没有明显变化时，就可以停止调整当前参数了，避免陷入过（欠）拟合；"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "\n",
    "class DataBinWrapper(object):\n",
    "    def __init__(self, max_bins=10):\n",
    "        # 分段数\n",
    "        self.max_bins = max_bins\n",
    "        # 记录x各个特征的分段区间\n",
    "        self.XrangeMap = None\n",
    "\n",
    "    def fit(self, x):\n",
    "        n_sample, n_feature = x.shape\n",
    "        # 构建分段数据\n",
    "        self.XrangeMap = [[] for _ in range(0, n_feature)]\n",
    "        for index in range(0, n_feature):\n",
    "            tmp = sorted(x[:, index])\n",
    "            for percent in range(1, self.max_bins):\n",
    "                percent_value = np.percentile(tmp, (1.0 * percent / self.max_bins) * 100.0 // 1)\n",
    "                self.XrangeMap[index].append(percent_value)\n",
    "            self.XrangeMap[index] = sorted(list(set(self.XrangeMap[index])))\n",
    "\n",
    "    def transform(self, x):\n",
    "        \"\"\"\n",
    "        抽取x_bin_index\n",
    "        :param x:\n",
    "        :return:\n",
    "        \"\"\"\n",
    "        if x.ndim == 1:\n",
    "            return np.asarray([np.digitize(x[i], self.XrangeMap[i]) for i in range(0, x.size)])\n",
    "        else:\n",
    "            return np.asarray([np.digitize(x[:, i], self.XrangeMap[i]) for i in range(0, x.shape[1])]).T"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import warnings\n",
    "warnings.filterwarnings(\"ignore\")\n",
    "import lightgbm as lgb\n",
    "from sklearn.metrics import mean_absolute_error, mean_squared_error, auc\n",
    "from pre_process.pre_process_wy1_beforejt3000_joincx import *\n",
    "import toad\n",
    "from sklearn import metrics\n",
    "diretory=\"xxx\"\n",
    "test0 = pd.read_csv(diretory+'xxx.csv', low_memory=False)\n",
    "test0 = test0[(test0['dpt'] == 217) & (test0['renew_flag'] == 1)]\n",
    "claim = pd.read_csv(diretory+'/data/claimnum.csv')\n",
    "claim = claim[['polnum_com', 'num']]\n",
    "claim = claim.rename(columns={'polnum_com': 'policy_no'})\n",
    "test0=pd.merge(test0,claim,on=\"policy_no\",how=\"inner\")\n",
    "test0['y'] = np.where(test0['num']>0, 1, 0)\n",
    "test0=test0.drop(\"num\",axis=1)\n",
    "dat1=test0.copy()\n",
    "#因子筛选\n",
    "dat1 = ppcess_filterCols(dat1, cols='join')\n",
    "#去掉饱和度过低的因子 by 80%\n",
    "na_cols = []\n",
    "for c in dat1.columns:\n",
    "    if dat1[c].isnull().sum()/len(dat1)>0.8:\n",
    "        na_cols.append(c)\n",
    "dat1 = dat1.drop(na_cols, axis=1)\n",
    "jt_cx_cols = ['xxx']\n",
    "dat1 = dat1.drop(jt_cx_cols, axis=1)\n",
    "dat1 = ppcess_dateCols(dat1,print_process_cols=False) # 日期因子格式\n",
    "dat1 = ppcess_toNum(dat1,print_process_cols=False) # 部分因子转换为数值型\n",
    "dat1 = ppcess_delCols_na(dat1,print_process_cols=False) # 删去饱和度低于1%的因子\n",
    "dat1 = ppcess_inpute(dat1,print_process_cols=False) # 部分因子按规则进行空值填充，数值型因子在后面做均值填充\n",
    "dat1 = ppcess_delCols_uniValue(dat1,print_process_cols=False) # 删去全部唯一值的因子\n",
    "\n",
    "dat2 = dat1.copy()\n",
    "dat2 = feature_catValues(dat2,print_process_cols=False) # 修正部分因子取值\n",
    "dat2 = feature_factorLabel1(dat2,print_process_cols=False) # 离散特征编码1\n",
    "dat2 = feature_factorLabel2(dat2,print_process_cols=False) # 离散特征编码2\n",
    "dat2 = feature_lambda(dat2,print_process_cols=False) # 构建一些特征\n",
    "dat2 = dat2.drop(['xxx'], axis=1)\n",
    "dat2['xxx'] = dat2['xxx'].fillna(0)\n",
    "dat2['y']=test0['y']\n",
    "#切分训练、验证、测试\n",
    "trn_df=dat2[test0['effdate'].apply(lambda x:x[:7]<=\"2019-10\")]\n",
    "\n",
    "dev_test_df=dat2[test0['effdate'].apply(lambda x:x[:7]>=\"2019-11\")]\n",
    "indice=list(range(0,dev_test_df.shape[0]))\n",
    "np.random.shuffle(indice)\n",
    "dev_test_df=dev_test_df.iloc[indice]\n",
    "dev_df=dev_test_df.iloc[:dev_test_df.shape[0]//2]\n",
    "test_df=dev_test_df.iloc[dev_test_df.shape[0]//2:]\n",
    "\n",
    "#target encoding\n",
    "object_cols=trn_df.dtypes[trn_df.dtypes==object].reset_index()['index'].tolist()\n",
    "trn_df[object_cols]=trn_df[object_cols].fillna(\"missing\")\n",
    "dev_df[object_cols]=dev_df[object_cols].fillna(\"missing\")\n",
    "test_df[object_cols]=test_df[object_cols].fillna(\"missing\")\n",
    "object_target_cols=[]\n",
    "for col in object_cols:\n",
    "    object_target_cols.append(col+\"_target\")\n",
    "    trn_df[col+\"_target\"]=trn_df[col]\n",
    "    dev_df[col+\"_target\"]=dev_df[col]\n",
    "    test_df[col+\"_target\"]=test_df[col]\n",
    "import category_encoders as ce\n",
    "le=ce.TargetEncoder(cols=object_target_cols)\n",
    "le.fit(trn_df,trn_df['y'])\n",
    "trn_df=le.transform(trn_df)\n",
    "dev_df=le.transform(dev_df)\n",
    "test_df=le.transform(test_df)\n",
    "#ordinary encoding\n",
    "oe=ce.OrdinalEncoder()\n",
    "oe.fit(trn_df,cols=object_cols)\n",
    "trn_df=oe.transform(trn_df)\n",
    "dev_df=oe.transform(dev_df)\n",
    "test_df=oe.transform(test_df)\n",
    "#分箱做一次WOE\n",
    "trn_woe_df=trn_df.drop([\"policy_no\",\"y\"],axis=1).copy()\n",
    "dev_woe_df=dev_df.drop([\"policy_no\",\"y\"],axis=1).copy()\n",
    "test_woe_df=test_df.drop([\"policy_no\",\"y\"],axis=1).copy()\n",
    "woe_cols=[item+\"_woe\" for item in trn_woe_df.columns]\n",
    "dbw=DataBinWrapper()\n",
    "dbw.fit(trn_woe_df.values)\n",
    "trn_woe_df=pd.DataFrame(data=dbw.transform(trn_woe_df.values),columns=woe_cols)\n",
    "dev_woe_df=pd.DataFrame(data=dbw.transform(dev_woe_df.values),columns=woe_cols)\n",
    "test_woe_df=pd.DataFrame(data=dbw.transform(test_woe_df.values),columns=woe_cols)\n",
    "trn_woe_df=trn_woe_df.astype(\"object\")\n",
    "dev_woe_df=dev_woe_df.astype(\"object\")\n",
    "test_woe_df=test_woe_df.astype(\"object\")\n",
    "woe_encoder=ce.WOEEncoder()\n",
    "woe_encoder.fit(trn_woe_df,trn_df['y'],cols=woe_cols)\n",
    "trn_woe_df=woe_encoder.transform(trn_woe_df)\n",
    "dev_woe_df=woe_encoder.transform(dev_woe_df)\n",
    "test_woe_df=woe_encoder.transform(test_woe_df)\n",
    "trn_df=pd.concat([trn_df.reset_index(),trn_woe_df.reset_index()],axis=1).drop([\"index\"],axis=1)\n",
    "dev_df=pd.concat([dev_df.reset_index(),dev_woe_df.reset_index()],axis=1).drop([\"index\"],axis=1)\n",
    "test_df=pd.concat([test_df.reset_index(),test_woe_df.reset_index()],axis=1).drop([\"index\"],axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "训练模型"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#自定义metrics\n",
    "def eval_function(y_true,y_pred):\n",
    "    try:\n",
    "        y_pred=y_pred.get_label()\n",
    "    except:\n",
    "        pass\n",
    "    sort_indice=np.argsort(y_pred)[::-1]\n",
    "    metric_value=y_true[sort_indice[:int(0.05*len(y_true))]].mean()\n",
    "    return \"eval_function\",metric_value,True\n",
    "def eval(y_true,y_pred):\n",
    "    return np.round(eval_function(y_true,y_pred)[1]/eval_function(y_true,y_true)[1],2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 调参推荐：https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html\n",
    "# 超参\n",
    "params = {\n",
    "'boosting_type':'gbdt',#集成方式，还包括rf,dart,goss等\n",
    "'objective':'binary',\n",
    "'metric':\"eval_function\",\n",
    "'learning_rate':0.03,#学习率\n",
    "'max_depth':8,#单颗树的最大深度\n",
    "'num_leaves':200,#叶子节点数\n",
    "'max_bins':255,#分箱数\n",
    "'lambda_l1':1e-3,#l1正则化权重\n",
    "'lambda_l2':1e-3,#l2正则化权重\n",
    "# 'min_data_in_leaf':5,#叶子节点最小记录数\n",
    "# 'feature_fraction':0.9,#特征抽取比例\n",
    "}\n",
    "\n",
    "trn_x = trn_df.drop(['policy_no','y'], axis=1)\n",
    "trn_y = trn_df['y']\n",
    "\n",
    "val_x = dev_df.drop(['policy_no','y'], axis=1)\n",
    "val_y = dev_df['y']\n",
    "\n",
    "trn_data = lgb.Dataset(trn_x, trn_y,categorical_feature=object_cols)\n",
    "val_data = lgb.Dataset(val_x, val_y,categorical_feature=object_cols)\n",
    "\n",
    "reg = lgb.train(params = params,\n",
    "                train_set = trn_data,\n",
    "                num_boost_round = 500,#最大树数量\n",
    "                early_stopping_rounds = 20,#如何验证集效果在20轮中没有明显变好，就终止\n",
    "                feval=eval_function,\n",
    "                valid_sets = [val_data])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "训练集预测结果 0.72\n",
      "验证集预测结果 0.46\n",
      "测试集预测结果 0.5\n"
     ]
    }
   ],
   "source": [
    "test_x = test_df.drop(['policy_no','y'], axis=1)\n",
    "test_y = test_df['y']\n",
    "print(\"训练集预测结果\",eval(trn_y.values,reg.predict(trn_x)))\n",
    "print(\"验证集预测结果\",eval(val_y.values,reg.predict(val_x)))\n",
    "print(\"测试集预测结果\",eval(test_y.values,reg.predict(test_x)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这时,    \n",
    "偏差=1.0-0.72=0.28   \n",
    "方差=0.72-0.46=0.26  \n",
    "比较符合预期，偏差下降不少，但方差有所上升，目前偏差似乎也要也要接近极限了，接下来需要考虑降低方差，这内容放到下一页，在这里，我们可以再尝试一下自动调参技术，之前的note中介绍过网格、随机、贝叶斯调参，这里我们尝试一下automl,与人工调参做对比    \n",
    "\n",
    "### 六.AutoML  \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#https://microsoft.github.io/FLAML/\n",
    "#https://github.com/microsoft/FLAML\n",
    "from flaml import AutoML\n",
    "automl = AutoML()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def eval_function_automl(X_test, y_test, estimator, labels,\n",
    " X_train, y_train, weight_test=None, weight_train=None):\n",
    "    eval_score=eval_function(y_test,estimator.predict(X_test))[1]\n",
    "    return -1*eval_score,(eval_score,)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# %%timeit\n",
    "# automl.fit(trn_x, trn_y,X_val=val_x,y_val=val_y, task=\"classification\", estimator_list=[\"lgbm\"],\n",
    "#            metric=eval_function_automl)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "训练集预测结果 0.13\n",
      "验证集预测结果 0.05\n",
      "测试集预测结果 0.05\n"
     ]
    }
   ],
   "source": [
    "print(\"训练集预测结果\",eval(trn_y.values,automl.predict(trn_x)))\n",
    "print(\"验证集预测结果\",eval(val_y.values,automl.predict(val_x)))\n",
    "print(\"测试集预测结果\",eval(test_y.values,automl.predict(test_x)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# %%timeit\n",
    "# automl.fit(trn_x, trn_y,X_val=val_x,y_val=val_y, task=\"classification\", estimator_list=[\"lgbm\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "训练集预测结果 0.13\n",
      "验证集预测结果 0.05\n",
      "测试集预测结果 0.05\n"
     ]
    }
   ],
   "source": [
    "print(\"训练集预测结果\",eval(trn_y.values,automl.predict(trn_x)))\n",
    "print(\"验证集预测结果\",eval(val_y.values,automl.predict(val_x)))\n",
    "print(\"测试集预测结果\",eval(test_y.values,automl.predict(test_x)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "# %%timeit\n",
    "# automl.fit(trn_x, trn_y,X_val=val_x,y_val=val_y, task=\"classification\", estimator_list=[\"lgbm\"],\n",
    "#            metric=\"f1\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "训练集预测结果 0.68\n",
      "验证集预测结果 0.31\n",
      "测试集预测结果 0.34\n"
     ]
    }
   ],
   "source": [
    "print(\"训练集预测结果\",eval(trn_y.values,automl.predict(trn_x)))\n",
    "print(\"验证集预测结果\",eval(val_y.values,automl.predict(val_x)))\n",
    "print(\"测试集预测结果\",eval(test_y.values,automl.predict(test_x)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
